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Abstract 

Multi-class classification is one of the most important tasks in ma- 
chine learning. In this paper we consider two online multi-class classi- 
fication problems: classification by a linear model and by a kernelized 
model. The quality of predictions is measured by the Brier loss function. 
We suggest two computationally efflcient algorithms to work with these 
problems and prove theoretical guarantees on their losses. We kernehze 
one of the algorithms and prove theoretical guarantees on its loss. We 
perform experiments and compare our algorithms with logistic regression. 

1 Introduction 



Onlin e prediction is a wide area of machine learning fsee lCesa-Bianchi and Lugosi , 

I ts algorithms can be applied to different data mining problems (see for 
example Freund and Schapirel 1997 ). Online prediction provides efficient algo- 
rithms which adapt to a predicted process "on fly" . In online regression frame- 
work we assume the existence of some input at each step and try to predict 
an outcome on this input. This process is repeated step by step. We consider 
multi-dimensional Brier game where outcomes and predictions come from a sim- 
plex and can be thought of as probability distributions on the vertices of the 
simplex. If the outcomes are identified with vertices of the simplex this problem 
can be thought of as the multi-class classification problem of the given input. 

In the simple case the dependence between the input and its outcome is 
assumed to be linear; linear regression minimising the expected loss is studied 
in statistics. As opposite to the traditional statistical setting, the learner in 
online prediction does not make any statistical assumptions about the data 
generating process. Its goal is to predict as well as the best linear function on 
input. Instead of looking for the best linear function, our learner considers all 
linear functions and makes his prediction by mixing them in a certain way at 
each prediction step. We prove theoretical bounds on the cumulative loss of the 
learner in comparison with the cumulative loss of the best linear function (we 



say the learner competes with these functions). We consider the square loss: 
mea n square err or is one of the benchmark measures for classification algorithms 
(see Brier . 1950l ). 

We use Vovk's Aggregating Algorithm (a generalization of the Bayesia n mix- 
ture) to mix functions (as in Aggregating Algorithm Regression, AAR: see lVovk . 
200l[ ). This method has previously been applied to the case when possible out- 



comes lie in a segment of the real line, and so the prediction was one-dimensional. 
We develop two algorithms to solve the problem of multi-dimensional predic- 
tion. The first algorithm applies a variant of AAR to predict each coordinate of 
the outcome separately, and then combines these predictions in a certain way to 
get probability prediction. The other algorithm is designed to give probability 
predictions directly; these are first computationally efficient online regression 
algorithm designed to solve linear and non-linear multi-class classification prob- 
lems. We derive theoretical bounds on the losses of both algorithms. We come 
to an unexpected conclusion that the component-wise algorithm is better than 
the second one asymptotically, but worse in the beginning of the prediction 
process. Their performance on benchmark data sets is very similar. 

One component of the prediction of the second algorithm has the meaning 
of a remainder. In practice this situation is quite common. For example, in a 
foot ball match either on e team wins or the other, and the remainder is a draw 
(see IVovk and Zhdanov! (fioOS) for online prediction experiments in football). 
When we analyse a precious metal alloy we may look for a description of the 
following kind: the alloy has 40% of gold, 35% of silver, and some addition 
(e.g., copper and palladium). It is common for financial applications to predict 
the direction of the price: the price can go up, down, or stay close to the 
current value. We perform classification experiments with linear algorithms 
and compare them with logistic regression. 

A description of the framework can be found in Section [2l description of the 
algorithms can be found in Section [3l and derivation of the theoretical bounds 
can be found in Section S] 

We look for a way to extend the class of experts using the kernel trick. We 
kernelize the second algorithm and prove a theoretical bound on its loss. The 
cumulative loss of the kernelized algorithm is compared with the cumulative loss 
of any finite set of functions from the RKHS given by a kernel parameter it uses. 
Kernelization process is described in Section [5l Our experiments are shown in 
Section [6l Section [7] makes the conclusions and shows some possibilities for 
prospective work. 



2 Framework 

A game of prediction contains three components: a space of outcomes, a 
decision space F, and a loss function A : r2 x F — > K. We are interested in the 
generalisation of the Brier game from lBrieij(|l950l) where the space of outcomes 
r2 = P(S) is the set of all probability measures on a finite set S with d elements, 
F := {(71, . . . , jd) '■ X^iLi 7* = 1? 7i G ^} is a hyperplane in d-dimensional space 
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containing all the outcomes, and for any y E Q wc define the loss 

For example, if r2 = {1,2, 3}, w = 1, 7{1} = 1/2, 7{2} = 1/4, and 7{3} = 1/4, 
A(w,7) = (l/2-l)2 + (l/4-0)2 + (l/4-0)2 = 3/8. Brier loss is one of the most 
important loss functions used to assess the quality of classification algorithms. 
The game of prediction is being played repeatedly by a learner receiving some 
input vectors G X C R", and follows prediction protocol [TJ 



Protocol 1 Protocol of forecasting game 
Lo :-0. 

for t= 1,2,... do 

Reality announces a signal a;* G X C M". 
Learner announces 74 G F C K.''. 
Reality announces ?/t G O C M''. 
Lt Lt-i + \{yt,jt)- 
end for 



We find an algorithm which is capable of competing with all linear functions 
(we call them experts) = (^j , . . . , £_fy on x: 

= l/d + a[xt 

where Ui — {a\, . . . , a")', i = 1, . . . , d — 1. In the model ([T]) the prediction for 
the last component of an outcome is calculated from the predictions for other 
components. Denote a — [a'^^ . . . ,a'j^_-^)' G 8 = R"-^'^"-'^). Then any expert 

can be presented as ft = £,t{ct). Let also Lt{q) — Y^=i^{Vt^^t{'^)) be the 
cumulative loss of an expert a over T trials. 



(1) 



\i=l 



3 Derivation of the algorithms 

In this s ection we de scribe how we apply the Aggregating Algorithm (AA) pro- 
posed in IVovkl (Il990[ ) to mix experts and make predictions. The algorithm keeps 



weights Pt-i{da) for the experts at each prediction step i, and updates them 
by the exponential weighting scheme after the actual outcomes is announced: 

Pt{da) = /3^(^^''«'("»Pt_i(da), /3 G (0, 1). (2) 
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Here (3 — e ^, where 77 G (0, 00) is a learning rate parameter. This weight 
update ensures that the experts which predict badly at the step t receive less 
weight. The weights are then normalized Pf{da) — 



The prediction of the algorithm is a combina tion of the experts' predictions. 
It is suggested in Kivinen and Warmuth that the prediction is simply 



the weighted average of the experts' predictions with weights Pt{da). The Ag- 
gregating Algorithm uses more sophisticated prediction scheme, and sometimes 
achieves better theoretical performance. It first defines a generalised prediction 
at any step i as a function : — > M such that 

gt{y) ^logp f /3^(^'«*("))p,li(da) (3) 

for all y G O. It is a weighted average (in a general sense) of the experts' losses 
for each possible outcome. It then predicts any such that 

Hy,it)<gtiy) (4) 

for all possible y € ^l. If such prediction can be found for any weights distribu- 
tion on experts the game is called perfect ly mixable. Perfectly mixable games 
and other types of games are analyzed in IVovkl (|l998l) . It is also shown there 



that for countable (and thus finite) number of experts the AA achieves the best 
possible theoretical guarantees. 

3.1 Proof of mixability 

In this section we prove that our game is perfectly mixable and show a function 
that can be used to give pr edictions satisfying (|4l). 



It is shown in Theorem 1 Vovk and Zhdanovl ( 20081 ) that the Brier game with 



finite number of outcomes is perfectly mixable iff 77 € (Oi !]■ The two authors of 
that paper consider the outcome space of d probability measures concentrated 
in points of S. We denote this space by 7?.(S). They consider experts giving 
predictions from all probability measures 'P(S). We need to prove that the 
inequality (j4]) holds for our experts (jlj (who can give predictions outside of the 
probability simplex) and our outcome space (the whole probability simplex, 
not only its vertices). Lemma [2] describes the first part, but first we need to 
state an additional statement. The following lemma shows that any vector from 
Mf^ can be projected into simplex without increasing the Brier loss. 

Lemma 1. For any ^ ^ (6, • • • , fd) e there exists 9 = {9i,...,ed) G 7'(S) 
such that for any y Cz ^ we have \{y,9) < X{y,^). 

Proof. The Brier loss of a prediction 7 is a square Euclidean distance between 
7 and the actual outcome y in a d-dimensional space. The proof follows from 
the fact that f2 is a convex and closed set in R''. □ 
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Lemma 2. Let P{da) be any probability distribution on Q. Then for any rj S 
(0, 1] there exists 7 S F such that for any y G we have 

A(y,7) <log^ / /3^(^^«("))p(da). 
Je 

Proof. By Lemma [T] for any ^(a) we can find 6{a) g 'P(S) such that the loss of 
experts decreases: X{y,9{a)) < \{y,£^{a)) for any y G TZ{T,). Thus we have 



^og0 / 13 

le J0 



P{da) < log. [ ^^(^'«("»P(da) 
Je 



for any y G 7?.(S). We can take the same prediction 7 S F that satisfies the 
necess ary inequahty with 6 instead of ^. By Theorem 1 in IVovk and Zhdanov! 
(|2008l) such prediction exists for any 77 G (0, 1] {J3 G [e~^, 1)). □ 

A way to convert the generaUsed prediction into the prediction of AA is 
called a substitution function. We prove that we can use the same substitution 
function and the same learning rate parameter 77 as f or the case of finite numbe r 
of possible outcomes. Such a function is propos ed in Vovk and Zhdanov ( 20081 ). 
This is an extension of Lemma 4.1 from Haussle r et al. ( 1998[ ). 

Lemma 3. Let P{da) be a probability distribution on and put 

f{y) = log^ / /?^('''«"))p(da) 
Je 

for every j/ G fi. Then if j is such a prediction that A(z,7) < /(z) for any 
z G 7^(E) then X{y,j) < f{y) for any y eil. 

Proof. For the typographical reasons we will write ^ instead of £,{a). It is easy 
to ensure that A(?;,7) - A(y,^) = J2aesyW}iHza,l) - Hza,0] for Za{p} = 
if (J 7^ p and z^ip} — 1 if a ~ p. We also have that X{y,j) — f{y) < is 
equivalent to ^^ ji^^y-^'^-^^v^'^'^P^da) < 1. Thus due to the convexity of the 
exponent function /p P^-ei: y{'^mz„^0-Mz„,',)]pi^^^-^ < ^^^^ ^ □ 

Let us denote the i-th possible outcome from 7?.(I]) by y{i}, i — I, . . . ,d. We 
use the substitution function defined by the following proposition: 

Proposition 1. Let ri — g{y{i}), and = max(x, 0). Define s E M. by the 
requirement 

d 

J2{s-r,)+=2. 
If the prediction of the Aggregating Algorithm is given by 



then (HI) holds. 
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This function allows us to avoid weights normalization in calculating the 
generalized prediction at each step (avoid * in the weights distribution), which 
would be computationally inefficient. Suppose we can get only r = gt{y) + C 
instead of gt{y), where C is the same for all y. Then predictions 74 defined by 
the substitution function from Proposition[T]will be the same as if we calculated 
the generalized prediction with weights normalization. 

3.2 Algorithm for multidimensional outcomes 

We set the prior weights distribution Pq over the set 9 — R"(''^i) of experts a 
to have the Gaussian density with a parameter a > 0: 

(a77/7r)"('^-i)/2e"'^''ll"ll'da. 

Instead of taking the integral in ([3]) we get a shifted generalised prediction r by 
calculating ri — gT{y{i}) — gT{y{d}) (we omit the index T in r for brevity). 
Each component of r = (ri, . . . , r^) corresponds to one of the possible outcomes, 
so rd = 0. Other components, i = — 1: 

where by Q{a,y) we denote the quadratic form: 

Q(«:2^) = EE((yt-e(^*))'. 

t=l i=l 

Here yt = {y},...,yf) are the outcomes on the steps before T and yx = 
{y^, . . . , y^) is a possible outcome on the step T. 

Let C = Y^f^i xtx[ be n X n matrix. The quadratic form Q can be divided 
into a quadratic part, a linear part, and a remainder: Q — Qi + Q2 + Qs- Here 

Qi{a, y) — a'Aa 

is a quadratic part of Q{a,y). Here A is a square matrix with n(d — 1) rows 
(see the expression for A in the algorithm below). The linear part is equal to 

Q2(a, y) = /I'a - 2 ^{y'r - yT)aiXT, 

i=l 

where hi = —2 X]tL/(yt ~ yt)^t, i = 1, . . . , d — 1 make up a big vector h = 
{h'l, . . . , /i^_]^)'. The remainder is equal to 

T-l d d 
t=l i=l i=l 

Ratio for can be calculated using the following lemmas. The integral 
evaluates as follows: 
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Lemma 4. Let Q{a) = a' Aa + b'a + c, where a,b € R", c is a scalar and A is 
a symmetric positive definite n x n matrix. Then 



VdetA 



where Qq = mincgRn Q{a) 



The proof of this lemma can be found in iHarvilld (|l997l . Theorem 15.12.1). 
Following this lemma, we can rewrite as = F{A, hi, Zi), i = 1, . . . , d — 1, 
where 

F{A,b^,Zi) = mmQ{a,y') - mmQ{a,y'^). 
Variables bi , Zi and the precise formula for F are defined by the following lemma 
Lemma 5. Let 

F(A, b, z) — min (a Aa + b'a + z'a) — min (a' Aa + b'a — z'a), 

where b, z d M" and A is a symmetric positive definite n x n matrix. Then 
F{A,b,z) = -b'A'^z. 

Proof. This lemma is proven by taking the derivative of the quadratic forms in 
by a and cal culating the ni inimum: miUcgRn (a'^la + c'a) — — -^^^-j-^c for 
any c G M" fsee lHarvilM Il997l Theorem 19.1.1). □ 



We can see that bi = h + {x'rp, . . . , x'rp, 0, x^, . . . , x'j.)' G R"^''"^), where is a 
zero-vector from M". We also have Zi = {—x'rp, . . . , —x'j,, —2x'rp, —x'j,, . . . , —x'j,)'. 
Thus we can calculate d — 1 differences r^, assign rd — 0, and then apply 
the substitution function from proposition [1] to get predictions. The resulting 
algorithm is Algorithm 1. We will further call it mAAR (multi-dimensional 
Aggregating Algorithm for Regression). 

3.3 Component-wise algorithm 

In this section we derive the component-wise algorithm. It gives predictions for 
each component of the outcome separately, and then combines them in a special 
way. 

First we explain why w e should not directly use the algorithm and the the- 
oretical bound proposed in I Vovkl ( 2001 ). Vovk's experts do not allow us to take 



advantage of the fact that only one outcome is possible to happen at each mo- 
ment. They are more suitable for the case when each input vector x can belong 
to many classes simultaneously in case of classification. In other words, they are 
centered around the center 1/2 of the prediction interval [0, 1]: = 1/2 + a^x. 
Assume that the number of outcomes is very large and the distribution on ex- 
perts is normal N{0, a^) with small a. Then the average experts' prediction is 
(1/2,. ..,1/2, 1 — (d— 1)/2)), and the average loss of the experts on trials with the 
same outcome y — y{i} (we can take y = (1,0,..., 0)) is (d— l)/2^ + (d— 1)^/2^. 
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Algorithm 1 niAAR for the Brier game 



Fix n, a > 0. C = 0,h = 0. 
for t= 1,2,... do 
Read new xt G X. 

C = C + Xtx'^, A^al + 



(2C 



\C 

Set hi = + (xj, . . . , Xj, 0, xj, . 
placed at i-th position, i = 1, 




where is a zero- vector from 



IS 



Set x% — 1 



^2x^, — Xj, . . . , —x^y, where —2x[ is placed at i-th 



position, i = l,...,(i— 1. 

Calculate := —b[A^^Zi,rd := 0, i = 1, . . . , d — 1. 

Solve Ef=i('5 - = 2 in s e M. 

Set := (s- r,)+/2, a; e 17, i = l,...,d. 

Output prediction G ^(f^)- 

Read observation yt. 

h, = h,- 2{y\ - yf)xu h ^ {h[, . . . , h'^_J. 
end for 



Components of experts ([T]) concentrate around the point l/d, and so experts 
have the average loss [d — l)/d^ + (1 — l/d)^. This loss is smaller than the loss 
of Vovk's experts for large values of d. 

Our component-wise experts are expressed by 

Q = l/d + a[xt, i^l,...,d. (5) 

The derivation of the component- wise algorithm (further cAAR stands for component- 
wise Aggregating Algorithm Regression) is similar to the derivation of Algo- 
rithm [T] for two outcomes. The initial distribution on each component of ex- 
perts ([S]) is given by 

Note that the value for fj here will be different from 1 since the loss function by 
each component is half of the Brier loss A(y, 7) = (y — 7)^ + {1 — y — {1 — 7))^. 
We will further see that fj = 2. The loss of expert £,{ai) over the first T trials is 

T / \ ( ^ \ ^ 

Y^iyi-i/d-a'^xtf = «n E E(yt - +Y.{yi-i/dr. 

t=l \i=l / \t=l / 4=1 

Instead of the substit ution f unctio n from Proposition [T] we use the substitution 
function suggested in I Vovkl (|200l[) for the one-dimensional game: 

, 1 , ffT(0)-gT(l) 

^^=2 + 2 
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Therefore, the substitution function can be represented as 

11 ^St(O) 

11 f^-Va'iBa,+2va'^(E+(0-l/d)xT)-v{W+l/d'')j^^^ 
" 2 ^ 2 J^^ g-na'^Ba,+2fja'^(E+(l-l/d)xT)-v{W+{l-l/dy}j^^. 

1 1 / d-2 



(6) 



for i — I, . . . ,d. Here B = al + J2t=i xtx[, E — X^tl/l^t ~ l/d)xt, W = 
J2jSi{yl — /3 = e^''. The transitions are justified using Lemma|l]and 

Lemma [5] 

Then this method projects its prediction onto the prediction simplex such 
th at the loss doe s not increase. We use the projection algorithm suggested 
in lMichelotI (|l986[ ). 



Algorithm 2 Projection of a point from M" onto probability simplex. 
Initialize 7 = 0,a; = leM'*. 

Let 7t be the prediction vector and |/| is the dimension of the set /. 
while 1 do 

li 'Jx ^ ^ all i = 1, . . . , d then break; 
/ = JU{i:7^<0}; 
If 7^ < for some i then 7^ = 0; 
end while 



4 Theoretical bound 

We derive the theoretical bounds for the losses of Algorithm [T] and of a naive 
component-wise algorithm predicting in the same framework. 

4.1 Component-wise algorithm 

We prove here the theoretical bound for the loss of cAAR. The following lemma 
is the main tool helping us to pro ve our theorems. It is easy to prove the 
following statement (Lemma 1 from I Vov5 ( 200l[) ): 



Lemma 6. // the learner follows the Aggregating Algorithm in a perfectly mix- 
able game, then for every positive integer T , every sequence of outcomes of the 
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length T, and any initial weights distribution on experts Polda) it suffers loss 
satisfying for any a € & 

LT(AA(r,,Po)) <log/3 / /3^-(")Fo(rfa). (7) 
Je 

Proof. We proceed by induction in T; for T — the inequality is obvious, and 
for T > we have: 

iT(AA(?7,Po)) < Lt-i(AA(77, Po)) +ffT(^T) 

= log^ / (3^--^Poid9)+\og^ ( /3^("-«*')^^^:i:_^Po(d0) 

e 



log^ / /3^-Po(de) 



Here the second equahty follows from the inductive assumption, the defini- 
tion dg of gr, and □ 

The loss of the component-wise algorithm by one component is bounded as 
in the following theorem. 

Theorem 1. Let the outcome space in the prediction game be [A, B], A, B € JR. 
Assume experts' predictions at each step are — C + a'xt, where a € R", 
C € M js the same for all the experts a, and ||a;t||oo < X,\ft. There exists a 
prediction algorithm producing ji G R, i — l,...,d such that for any a > 0, 
every positive integer T , every sequence of input vectors and outcomes of the 
length T and any a G R" we have 

jZilt - Vt? < E(6 - + a\H\l + In + l) . (8) 

Proof. We need to prove that the game is perfectly mixable (see (|3])) and find 
the optimal parameter rj for the al g orithm . Implications similar to the ones in 
the proof of Lemma 2 from IVovk (l200ll) lead to the inequality rj < .^^ . 
Clearly, Lemma |6] holds for our case, so we need only to calculate the difference 
between the right-hand side of ([7]) 



log^ / (iQ;(a?7/7r)"/^ exp 



-770;' l^a/ -|- ^t^'tj 

/ T \ ^ 

+ 77 2a' J2iyt - C)xt - rjY^iyt - Cf 



t=i 



and the loss of the best expert u'q (^al + J^J^i Xtx'^^ ao— 2aQ (j^J^iivt — C)xt^ + 
Y^J^iiUt — C)^ . Here ag is the point where the minimum of the quadratic form 
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is attained. Then due to Lemma H] this difference will be equal to 



-lndet^/+-gx,x;i < V M n^ 



We bound the determinant of a s ymmetric positive definite matrix by the prod- 



uct of its diagonal elements fsee iBeckenbach and BellmanI (|l961l ). Chapter 2, 
Theorem 7) and use r] ~ (b-A)^ • '-' 

Interestingly, the theoretical bound for the regression algorithm depends only 
on the size of the prediction interval but not on the location of it. It also does 
not depend on the concentration point of experts. We use the component-wise 
algorithm to predict each component separately. 

Theorem 2. // ||.Ti||oo 5: ^jVt, then for any a > 0, every positive integer T, 
every sequence of outcomes of the length T , and any a G ]R"('^~^) the loss Lt of 
the component-wise algorithm satisfies 

Lt< LTia)+da\\a\\l + hi (^^^^ + 1^ . (9) 

Proof. We extend the class of experts in ([IJ in ([S]). The algorithm predicts 
each component of the outcome separately. Summing theoretical bounds (|8]) for 
d components of the outcome, taking ad = — X^tTi^ '^ii using the Cauchy 
inequality || J2iZi chilli ^ (d— 1) llo^illi get the bound. To give proba- 

bility forecasts we can project prediction points on the prediction simplex using 
Algorithm [2] The bound will then hold by Lemma [TJ □ 

4.2 Linear forecasting 

The theoretical bound for the loss of the Algorithm [1] is 

Theorem 3. // ||a;t||oo < X,yt, then for any a > 0, every positive integer T, 
every sequence of outcomes of the length T, and any a G R"(''^i) mAAR(2a) 
satisfies 

,2 , n{d-l)^ fTX' 



iT(mAAR(2a)) < LTia)+2a\\a\\^ + — ^ In — + Ij . (10) 

Proof. We apply mAAR with the parameter b = 2a. Recall that C = X^tLi xtx[. 
Following the line of the proof of Theorem [T] with 77 — 1 we get the theoretical 
bound. □ 

We can derive a slightly better theoretical bound: in the determinant of A 
one should subtract the second block raw from the first one and then add the 
first block column to the second one, then repeat this d — 2 times. 
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Proposition 2. In the conditions of Theorem\^mAAR(a) satisfies 
LT(mAAR(a)) < Lria) + a||Q;||2 

The theoretical bound ([TU|) is worse asymptotically by d than the bound (O 
of the component-wise algorithm, but it is better in the beginning, especially 
when the norm of the best expert ||a|j is large. This can happen in the impor- 
tant case when the dimension of the input vector is larger than the size of the 
prediction set: n»T. 



5 Kernelization 

In some cases the linear model can be considered not rich enough to describe 
data well, and a more complicated model is needed. We us e a popular in com 



puter learning kernel trick, firstly applied to the AAR in iGammerman et al 
(;2004| ). We derive an algorithm competing with all sets of functions from an 
RKHS with d — 1 elements. 



5.1 Derivation of the algorithm 

Definition 1. Let us take xi, . . . ,x„ e X. A kernel function is a nonnegative 
function X : R" x M" M satisiying ^ fo^' positive 

integers n, all xi, . . . , x„ € X, and ^i, . . . , ^„ G R. 

An RKHS contains all linear regr essors (<&(•), /i)ff defin ed by means of a 



feature map (for all the definitions see Scholkopf and Smolal . [2002) . It can also 



be defined in a different equivalent way as a functional Hilbert space with con- 
tinuous evaluation functional (p : f G T f{x) for each a; € X. We will use 
the notation cjr(x) for the norm of this functional: cjr{x) :— supj.||j||^<]^ \ fix)\ 
and for the embedding constant cjr :— swp^^^cjr{x) and assume cjr < oo. 
Our algorithm competes with the following experts: 

^1/d + f^ixt) 

^t' ^l/d + fd-iixt) (12) 

Here fi, . . . , fd-i G are any functions from some RKHS J-. We start by 
rewriting mAAR in the dual form. Denote 

- -2{y\ -yf,.. . ,y^_, - -1/2), 

= -2(2/J-yf,...,y^_i-y^_i,0) 

k{xT) = [x'-^XT, . . . , x'rpXT)' , 

K = [x'g, xt)s,t is the matrix of scalar products 
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for z = 1, . . . , (i — 1, s,t — 1, . . . ,T. We show that the predictions of mAAR can 
be represented in terms of variables defined above. We wiU need the foUowing 
matrix property. 

Proposition 3. Let B, C he matrices such that the number of rows in B equals 
to the number of columns in C , and identity matrices I . If al+CB and al+BC 
are nonsingular then 



B{al + CB)-^ = [al + BCy^B. 



(13) 



Proof. This is equivalent to {al + BC)B = B{al + CB). That is true because 



of distributivity of matrix multiplication. 



/ 



Let us set A 



'2K 



K 



al 



\ 



K ■■■ 2Kj 
Lemma 7. On trial T values ri for i = I, 



□ 



. , d— 1 in mAAR can be represented 



(f. 



Proof By M 
Let us set 



■ A'^ (kixrY ■ ■ ■ 2k{xT)' ■ ■ ■ Kxt)'] 
(xi, . . . , xt) denote a matrix n x T of column input vectors. 



(14) 



B = 



(2M 



M 



2M, 



Vo 



Then hi from the algorithm mAAR equals hi — MYi e M". Decompose h'i = 
{Yi ■ ■ ■ Yi ■ ■ ■ Yd-ij C, where only the i-th block uses Yi. The matrix 
A is equal A = al + BC. Using proposition [3] 



r, = -b',A-^z, = - (yi 



■■ Y, ••• Yd-i 
{al + CB)-^C (-x: 



4)'. 



Lf rj^ rj-\ 

Note that K = M' M and k{xT) = M'xt, thus ([M]) holds. □ 

If instead of dot product in k{xT) we can choose a different kernel (clas- 

sical examples of kernels are Gaussian (RBF): K{xi, Xj) = e ^ , Vapnik's 
polynomial K{xi, xj) — {xi ■ xj + I)'', etc.). To get predictions one can use the 
same substitution function from Proposition [TJ We call this algorithm mKAAR 
(K for Kernelized). 
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5.2 Theoretical bound for the kernehzed algorithm 

To derive a theoretical bound for the loss of niKAAR wc will use the following 
matrix determinant identity lemma. 

Lemma 8 (Matrix determinant identity). Let B,C are as in Proposition\^ 
and a is a real number. Then det(a/ + BC) = det(a/ + CB). 

Proof. The proof is by considering a block matrix identity. □ 

The mai n theorem follows from the property of RKHS called Representer 
theorem fsee IScholkopf and Smo iai. I2OO2I Theorem 4.2). 

Theorem 4 (Representer theorem). Denote by g : [0, 00) M a strictly mono- 
tonic increasing function. Assume X is an arbitrary set, and T is a Reproduc- 
ing Kernel Hilbert Space of functions on X with the given kernel K : X^ — > 
K. Assume we also have a positive integer T and an arbitrary loss function 
c : (X X R^)-^ — > MlJ{oo}. Then each minimizer f E T of 

c {{xi,yi,f{xi)), {xT,yT, fixr))) + g(||/||.F) 

admits a representation of the form f{x) = Y^=i otiK{xi, x) for any x £ X and 
reals ai,i — 1, . . . ,T . 

The theoretical bound for the loss of mKAAR is proven in the following 
theorem. 

Theorem 5. Assume X is an arbitrary set of inputs and J- is a Reproducing 
Kernel Hilbert Space of functions on X with the given kernel K : X^ R. Then 
for any a > 0, any /i, . . . , fd-i S J^, any positive integer T , and any sequence 
of inputs and outputs [xi, yi), . . . , {xt, yr) 

d-i 

Lt (mKAAR) < Lrif) + oY^WMIt + ^lndetA (15) 

1=1 

Here the matrix K is a matrix of kernel values K{xi,Xj), i, j = 1, . . . ,T . 

Proof. The bound follows from Theorem |3] for mAAR and the Representer 
theorem. Let us first consider the case with scalar product kernel. Denote 
C = X]t=i ^tXf. By Lemma |8] and calculations similar to ones in the proof of 
Lemma[7]we have the equality of determinants. So we can use any other kernel 
instead of scalar product to get the term with the determinant. The Represen- 
ter theorem assures that the minimum of the expression Lxif) + o.'^f^^ 
by f-s is reached on a linear regressor. □ 



We can represent the bound (|15p in another form which is more familiar 
from the on-line prediction literature: 
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Corollary 1. Under assumptions of Theorem\^ and if we know the number of 
steps T in advance and are given F > 0, the mKAAR reaches the performance 

LT(mKAAR) < Lt(/) + 2c^Fy/ {d ^ l)T , (16) 
for any A, . . . , fa-i G F : Y.tl WMt < F- 

Proof. Bounding the logarithm of the determinant we have In det A < {d — 
1)T\tl{i + ^^ .We can choose the value for a where the minimum is achieved: 

a= '-^^ . U 



6 Experiments 

We run our algorithms on six real world time-series data sets. In the time series 
we consider there are no signals attached to the outcomes. However we can take 
vectors consisting of previous observations (we shall take ten of those) and use 
them as signals. Data set DEC-PKtQ contains an hours worth of all wide-area 
traffic between Digital Equipment Corporation and the rest of the world. Data 
set LBL-PKT-4^ consists of observations of another hour of traffic between the 
Lawrence Berkeley Laboratory and the rest of the world. We transformed both 
the data sets in such a way that each observation is the number of packets in the 
corresponding network during a fixed time interval of one second. The other four 
datasetfl (C4,C9,E5,E8) relate to transportation data. Two of them (C9,C11) 
contain low-frequency monthly traffic measures. Two of them (E5,E8) contain 
high-frequency day traffic measures. On each of these data sets the following 
operations were performed: subtraction of the mean value and division by the 
maximum absolute value. The resulting time series are shown in Figure [TJ 

We used ten previous observations as an input vector for tested algorithms 
at each prediction step. We are solving the 3-class classification problem: we 
predict whether the next value in a time series will be more than the previous 
value plus a precision parameter e, less than that value, or lies in the 2e tube 
around the previous value. The precision e is chosen to be the median of all the 
changes in a data set. In order to assess the quality of predictions, we calculate 
the cumulative square loss at the last two thirds of each time series (test set) 
and divide it by the number of examples (MSE). Since we are considering the 
online setting, we could calculate the cumulative loss from the beginning of each 
time series. However our approach is not sensitive to starting effects, it allows 
us to choose the ridge parameter a fairly on the training set, and it allows us to 
compare the performance of our algorithms with batch algorithms, which would 
be normally used to solve this problem. 

The square loss on the test set takes into account the quality of an algorithm 
only at the very end of the prediction process, and does not consider the quality 
during the process. We introduce another quality measure: at each step in the 

^Data sets can be found http://ita.ee.lbl.gov/html/traccs.html. 

^Data sets can be found http://www.ncural-forecasting-competition.com/index.htm. 
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Figure 1: Time series from 6 data sets. 



test set we calculate MSE of an algorithm until this step. After all the steps we 
average these MSEs (AMSE). Clearly, if one algorithm is better than another 
on the whole test set (its total MSE is smaller) but was often worse on many 
parts of the test set (total MSEs of many parts of the set is larger), this measure 
takes it into account. 

We compare the performance of our algorithms with the multinomial logistic 
regression (mLog) , because it is a standard classification algorithm which gives 
probability predictions: 



7mLog „a „i 

for all the components of the outcome i = 1, . . . , d. In our case d = 3. Here pa- 
rameters 01, . . . ,9d are estimated from the training set. We apply this algorithm 
in two regimes: batch regime, where the algorithm learns only on the training 
set and is tested on the test set (and thus 6 is not updated on the test set); 
and in the online regime, where at each step new parameters 6 are found, and 
only one next outcome is predicted. The second regime is more fair to compare 
with online algorithms, but the first one is standard and faster. In both regimes 
logistic regression does not have theoretical guarantees on the square loss. 

We also compare our algorithms with the simple predictor predicting the 
average of the ten previous outcomes (and thus it always gives probability pre- 
dictions) . 

We are not aware of other efficient algorithms for online probability pre- 
diction, and thus logistic regression and simple predictor as the only baselines. 
Component-w ise algorithms which could be used for online pr ediction (e.g.. Gra- 
dient Descent. [Kivinen and Warmuthlll997l Ridge Regression, iHoerl and KennardI 



2000|), have to use normalization by Algorithm [2j Thus they have to be applied 
in a different way than they are described in the corresponding papers, and can 
not be fairly compared with our algorithms. 
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The ridge for our algorithms is chosen to achieve the best MSE on the 
training set: the first third of each series. The results are shown in Table [TJ 
We highlight the most precise algorithms for different data sets. We also show 
time needed to make predictions on the whole data set. The algorithms were 
implemented in Matlab R2007b and run on the laptop with 2Gb RAM and 
processor Intel Core 2, T7200, 2.00GHz. 

As we can see from the table, online methods perform better than the batch 
method. Online logistic regression performs well, but is very slow. Our al- 
gorithms perform similar to each other and comparable to the online logistic 
regression, but are much faster. 



7 Discussion 

We consider an important generalization of the online classification problem. We 
presented new algorithms which give probability predictions in the Brier game. 
Both algorithms do not involve any numerical integration, and can be easily 
computed. Both algorithms have theoretical guarantees on their cumulative 
losses. One of the algorithms is kernelized and a theoretical bound is proven for 
the kernelized algorithm. We performed experiments with linear algorithms and 
showed that they perform relatively well. We compared them with the logistic 
regression: the benchmark algorithm giving probability predictions. 

Competing with linear experts in the case where possible outcomes lie in a 
more than 2-dimensional simplex was not widely considered by other researchers, 
so the comparis on of theoretical bound s can not be performed. Kivinen and 
Warmuth's work Kivinen and WarmuthI (|200lh includes the case when the pos- 



sible outcomes lie in a more than 2-dimensional simplex and their algorithm 
competes with all logistic regression functions. They use the relative entropy 
loss function £ and get a regret term of the order 0{^J Ct{(x)) which is upper 
unbounded in the worst case. Their prediction algorithm is not computationally 
efficient and it is not clear how to extend their results for the case when the 
predictors lie in an RKHS. 

We can prove lower bounds for the regret term of the order 0(^^ In T) for 



the ca se of the linear model ([T]) using methods similar to ones described in lVovk 



( 2001 ). and lower bounds for the regret term of the order 0{VT) for the case of 
RKHS. Thus we can say that the order of our bounds by time step is optimal. 
Multiplicative constants may possibly be improved though. 
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Set / Algorithm 


MSE 


AMSE 


Time 


DEC-PKT 


cAAR 


0.45906 


0.45822 


0.578 


mAAR 


0.45906 


0.45822 


1.25 


mLog 


0.46107 


0.46265 


0.375 


mLog Online 


0.45751 


0.45762 


2040.141 


Simple 


0.58089 


0.57883 





LBL-PKT 


cAAR 


0.48147 


0.479 


0.579 


mAAR 


0.48147 


0.479 


1.266 


mLog 


0.47749 


0.47482 


0.391 


mLog Online 


0.47598 


0.47398 


2403.562 


Simple 


0.57087 


0.5657 


0.016 


C4 


cAAR 


0.64834 


0.65447 


0.015 


mAAR 


0.64538 


0.65312 


0.062 


mLog 


0.76849 


0.77797 


0.016 


mLog Online 


0.68164 


0.7351 


4.328 


Simple 


0.69037 


0.69813 


0.016 


C9 


cAAR 


0.63238 


0.64082 


0.015 


mAAR 


0.63338 


0.64055 


0.063 


mLog 


0.97718 


0.91654 


0.031 


mLog Online 


0.71178 


0.75558 


10.625 


Simple 


0.6509 


0.65348 





E5 




cAAR 


0.34452 


0.34252 


0.078 


mAAR 


0.34453 


0.34252 


0.219 


mLog 


0.31038 


0.30737 


1.109 


mLog Online 


0.30646 


0.30575 


446.578 


Simple 


0.58212 


0.58225 





E8 




cAAR 


0.29395 


0.29276 


0.078 


mAAR 


0.29374 


0.29223 


0.25 


mLog 


0.31316 


0.30382 


0.109 


mLog Online 


0.27982 


0.27068 


83.125 


Simple 


0.69691 


0.70527 


0.016 



Table 1: The square losses and prediction time (sec) of different algorithms 
applied for time series prediction. cAAR and mAAR state for the derived 
algorithms, mLog states for the logistic regression, mLogOnline states for online 
logistic regression, and Simple stands for the simple average predictor. 
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